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Abstract. - We investigated long range correlations in two literary texts, Moby Dick by 
H. Melville and Grimm's tales. The analysis is based on the calculation of entropy like quantities 
as the mutual information for pairs of letters and the entropy, the mean uncertainty, per letter. 
We further estimate the number of different subwords of a given length n. Filtering out the 
contributions due to the effects of the finite length of the texts, we find correlations ranging 
to a few hundred letters. Scaling laws for the mutual information (decay with a power law), 
for the entropy per letter (decay with the inverse square root of n) and for the word numbers 
(stretched exponential growth with n and with a power law of the text length) were found. 



From a formal point of view a book may be considered as a linear string of letters. In this 
respect there exists a similarity to other linear structures . Usually the strings generated by 
dynamical systems show only short range correlations, except under critical conditions where, 
in analogy to equilibrium phase transitions ||^ , correlations on all scales may be observed . 
Recently several studies on long range correlations in DNA sequences and in human 
writings ||] have been published. The intrinsic difficulties connected with the analysis of 
long range correlations in DNA led to a controversial discussion about the authentic character 
of long range structures in DNA . 

This work is devoted to the investigation of long range correlations in texts. We use the 
methods of entropy analysis, which were first applied to texts by Claude Shannon in 1951 0. 
For several reasons we expect the existence of long range structures in these sequences. Since 
a book is written in a unique style and according to a general plan of the author, we expect 
correlations which are ranging from the beginning of a text up to the end 

Another strong argument for long correlations is based on the combinatorial explosion. 
Uncorrelated sequences generated on an alphabet of A letters have a manifold of A" different 
subwords of length n. A subword (block) is here any combination of letters including the 
space, punctuation marks and numbers. For n > 100 the number N{n) is extremely large. 
Hence we we must expect that only a very small subset N* (n) of the possible words appears 
in a text. Bounds of this kind are given by the rules of writing texts, i.e. by the rules of syntax 
as well as by semantic relations, which do not allow for an arbitrary concatenation of letters 
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to words and of words to sentences. The problem we address here is, whether the function 
N*{n) follows a simple scaling law. 

In earlier papers the conjecture has been made that the number of allowed subwords scales 
according to a stretched exponential law O, O] 



N* (n) ^ explcn"] with a<l 



c = const. 



(1) 



The the scaling rule (|l|) reduces the number of the allowed subwords drastically {N*{n) <C A") 
for large n. In order to describe a given string of length L using an alphabet of A letters 
we introduce the following notations Let A1A2 . . .An be the letters of a given substring 
of length n < L. Let further p^""^^ (Ai . . . An) be the probability for this substring (block). 
A special case is the probability to find a pair with (n — 2) arbitrary letters in between 
p^"^"^ {Ai , An) . Then we may introduce the mutual information for two letters in distance n 
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which is closely related to the autocorrelation function 
entropy per block of length n [0| : 
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12|. Further we define the 
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The block entropy is related to the mean number of words 

N*{n) - . 



by 



(4) 



As shown already by Shannon Hn/n is an important quantity expressing the structure of 
sequences. In H we assumed the scaling 



Hn/n = h + g ■ n^" ^ + e/n 
< /^o < 1 7 n ^ 00 



(5) 



Here h, the limit of the mean uncertainty, is called the entropy of the source. This quantity 
is positive for stochastic as well as for chaotic processes, g and e are constants; ii h,e > and 
g = the correlations in the string are short range corresponding to a Markov process with a 
finite memory fl^ . For periodic strings one finds h = g = Q, e > 0. The existence of a long 
range order in strings may be characterized by the condition g > describing a slowly decaying 
contribution to the asymptotics of the entropy per letter for large n. Of special interest for 
the further consideration of texts is the case h <^ 1, g > corresponding to a power law tail 
of the entropy decaying slower than 1/n. It would be interesting and important to estimate 
the limit entropy h for "homogeneous texts", however, there are not enough data to do it with 
sufficient reliability. In the following we assume that h « 0.01 . . .0.1. Therefore it may be 
neglected in our investigations which are restricted to n < 30. 

The mutual information is not a monotonic function of n. We define long range effects by 
power law tails of the averaged mutual information I{n). Here the averaging is carried out 
over a window comprising several of the typical oscillations (fluctuations). Several authors 
have demonstrated that DNA-sequences show a slowly decaying fluctuations at large scales 

We will apply the methods of entropy analysis to literary English represented by the books: 
"Moby Dick" by Melville (L « 1,170,200) and Grimm's Tales (L « 1,435,800). Pieces of 
music may be treated in a similar way H. For simplification we use an alphabet consisting 
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of A = 32 symbols: the small letters a ... 2 the marks , . ( ) # and the space; # stands 
for any number. In order to get a better statistics we have used for the entropy calculations 
also a restricted alphabet consisting of only A = 3 letters 0, M, L. codes for vowels, M for 
consonants and L for spaces and marks. 

To estimate the mutual information we count frequencies of pairs of letters at distance 
n. Fig. |l| shows the mutual information calculated for Moby Dick and for Grimm's Tales 
(A — 32). The results show well expressed correlations in the range n — 1 . . . 25 which are 
followed by a long slowly decaying tail. The obtained values for the transinformation I{k) 
become meaningless if they are smaller than the level of the fluctuations 51{k) due to the 
finite length L of the text jl^, 0: 

For our rather long texts with L > 10^ the fluctuation level has a value of about lO"**. The 




Fig. 1. - The mutual information calculated for Melville's Moby Dick and for Grimm's Tales (\ — 32). 



smoothed values for the mutual information for the range n = 25 . . . 1000 may be fitted by 
the scaling law I{k) — c\ ■ n~°-^^ + ci with ci = 1.5 • lO""*, C2 = 1.1 • 10~^. The constant 
C2 corresponds here to the level of fluctuations. Our results show that long texts show pair 
correlations which decay, at least up to distances of several hundred letters, according to a 
power law, however, in the range n — 100 . . . 1000 the fluctuations are rather strong and the 
mean square deviation reaches 20 . . . 40%. 

For the calculation of entropies we count the frequencies of sub words, where a subword of 
length n is defined as any combination of n letters. The results for n = 4, 9, 16, 25 letters in 
Grimm's Tales are shown in Fig. ^ in a rank ordered representation. The structure of the rank 
ordered distributions is for both texts rather similar, however the list of words is of course 
very different. 

The form of the subword distributions is distinctly not Zipf-like, it does not follow a 
power law. In the opposite, with increasing n there is a tendency to form a plateau |^. 
The effects due to finite n and the effects of finite length L tend to smooth the edges of 
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Fig. 2. - The observed rank ordered distribution of words of length n = 4, 9, 16, 25 for Grimm's Tales. 



the distribution iQ . The importance of length corrections for estimating the frequencies of 
subwords was considered by several authors . For a deeper analysis of this problem we 

refer to recent articles ^ |l^ . Our method for the entropy analysis uses an extrapolation 
of the entropy to infinite text length j|] . 

The probabilities which we need for the calculation of entropies are unknown and can only 
be estimated from the frequencies Ni{n) of the subwords of length n in a text of length L 
containing N = L + 1 — n subwords. Introducing the observed subword frequencies into the 
entropy definition leads to the observed entropies 



H°'^ - log(7V) - 1 ^ N,{n) ■ log {N,in)) 



This is a random variable with the expectation value 



(7) 



(8) 



Assuming a Bernoulli distribution for the letter combinations, the mean values can be calcu- 
lated explicitly ^ H. The result is 
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if iV*(n)>7V 
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The latter case corresponds to the situation that only a minor part of the possible subwords 
of length n have a chance to appear in the text. 

The relation between the effective number of words N* (n) and the block entropy i7„ is given 
by eq. (^) . Hence the expected block entropy may be represented as a function of log N with 
one free parameter iJ„ which is found by fitting the curves. In this way the block entropies 
for both books were calculated up to n = 26. For small word length i.e. n < 16 for A = 3 and 
n < 5 for A = 32, we used the approximation (^) for N*{n) <ti N. For larger n, i.e. n > 20 
for A = 3 and n > 10 for A = 32, we applied the approximation valid for N*(n) ^ N. 
More concrete, we measured the deviation between the observed entropy and log(A^). Then 
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the theoretical entropy Hn was estimated from N* (n) using eq. (Q) . In the intermediate region 
we apphed a smooth Pade approximation between both formulae. In a procedure of successive 
approximations the entropy Hn was considered as a free parameter which was fitted in a 
way that H^{^p {log N) came as close as possible to the measured (observed) entropy values. In 
practice this method breaks down for n > 30 if A = 3 and for n > 25 if A = 32. Longer subwords 
do not have a chance to appear several times in the text, what leads to large statistical errors. 

The calculations for n < 26 show that the square root law yields a reasonable approximation 
for the scaling of the entropy per letter with the word length n 

il„/(n • log(A)) « (4.84/ V^) -(7.57/n) (A = 3) 

7I„/(n • log(A)) « (0.9/V^) +(1.7/n) (A = 32) . ^'"^ 

Fig. 1^ shows the fit for the alphabet A = 3. The scaling law of the square root type was first 
found by Hilberg ^ by fitting Shannons original data. For n — 100 and A = 32 our scaling 
formula yields -ffioo ~ 10 • log(A) what is not far from Shannon's estimation -ffioo ~ 40 bits. 




1 2 3 4 5 6 

sqrt(word length) 

Fig. 3. - The scaling behavior of the block entropy H„ with the square root of the word length n for 
Moby Dick encoded by the alphabet A = 3. 



The number of subwords increases according to a stressed exponential law. For the growth 
we found the approximation 

N*^ 27-45v^-ii-3 (A = 3) (11) 

N*^ 24-5v^+8-5 (A = 32) . (12) 

We summarize now the results obtained for the two books: The scaling of the mutual infor- 
mation and the entropy per letter shows in agreement with earlier work ||^ that long texts are 
neither periodic nor chaotic but somehow in between. Taking into account length corrections 
we calculated block entropies up to n = 26 and mutual information values up to distances of 
a few hundred letters. Based on these data we formulated a hypothesis about the long range 
scaling. For the range n > 100 the pair correlations contained in the transinformation of long 
texts L > 10^ decay according to a power law, however the level of fluctuations is rather high. 
A reliable estimation of the block entropies for n > 30 is still an open question. The results for 
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the entropy of the two books suggest in agreement with Shannon's data and Hilberg's findings 
that the mean entropy per letter decays to its hmit according to a square root law. As a 
consequence the number of different subwords in texts increases with the number of letters n 
according to a stretched exponential law. Our estimations for the growth yield for n = 100 a 
total number of about 2^^ different subwords. Most of the subwords which would be possible 
from the combinatorial point of view are actually forbidden and do not appear in real texts. 
We investigated also how the number of genuine English words N{L) (formally defined here 
as sequences of letters between spaces and/or marks) increases with the length L of a text. 
For Grimm's Tales we found the scaling law N(L) = 22.8 • L'^'^^, i.e. reading the book we find 
permanently new words. 

More empirical data on long texts and further studies of the statistical effects due to finite 
length of the samples are needed in order to reach a more definite conclusion about the scaling 
properties. 
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